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ABSTRACT 


It is common to estimate a distribution by means of a step function. 
Such estimates can be made continuous by connecting the left points of 
the steps with straight line segments. In this paper, the best 
estimator of this class is found for data which is exponentially 
distributed using minimum risk. The risk is then compared with those 


of the sample distribution function and the Pyke estimator. 
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I. INTRODUCTION 


The motivation behind this thesis is to provide a contribution 
towards the objective of finding the best way to estimate an unknown 
continuous distribution. The cue is taken from References 4 and 5 
which have adopted the following principles: 

(1) The estimator itself should be a continuous function. 

(2) The estimator should be simple and natural. 

The first principle has not been generally adopted. The modern 
textbook solution [Ref. 3] is to use the sample distribution (step) 
function. A simple and natural extension of the step function, 
satisfying both of the abcve principles, is to connect the left end 
of the steps with straight line segments, following the example of the 


so-called Ogive curve. Thus, the class of estimators can be described 


by: 

(a) plot the points (CX), eee 

(b) connect these points with straight line segments where 0 < C, 
< Cy ce Cs < 1 are constants determined by some rule and X, < Xo 


ae A are the order statistics of a random sample of size n drawn 
from some unknown CDF, F. The only issue to be resolved is how to 
determine the sequence cat ter 
Reference 6 suggests using C. = ae refering to the result as the 
Pyke estimator, and shows that the expected squared error of a continuous 
Pyke estimator for some distribution, F, is no larger than the expected 
Squared error of the sample distribution function for a sufficiently 


large sample size. It is also shown that the Pyke estimator strictly 


dominates the sample distribution function for sample sizes greater 





than or equal to ame. The risk function used is the integrated 
expected squared error, a popular choice [Ref. 1]. 

iment +, the sequence cae for which the risk is minimized 
is found for the case where F is the uniform distribution. Compari- 
sons of this optimal risk with that of the sample distribution function 
and the Pyke estimator are made. The result is that the risk of the 
Pyke estimator is significantly closer to the optimal than the risk of 
the sample distribution function. 

The purpose of this thesis is to determine if the risk of the Pyke 
estimator remains closer to the optimal than that of the sample distribu- 
tion function for a random sample drawn from the exponential distribution. 
The approach will be similar to that used in Ref. 4. In an attempt to 
provide continuity for the reader, the notation of Ref. 4 is used 


wherever possible. 





II. DETERMINATION OF THE MINIMUM RISK COEFFICIENTS 


Let Cos C,> tee Cis C4 be an increasing sequence with Cy = 0 
and C4.) = 1. The continuous function estimator for F(x) can be 
defined by: 
—- Xe 
Weg = C+ a. Creo) Ct es) Xs Xt]? 
PS 0 rr 22) 
where X, < Xo eae Ky are the order statistics froman absolutly 


continuous distribution F(x). Assume that the population is lower 
bounded and contained in the interval [0,), then define Xo = 0, 


X = o and the risk function by: 


Niel — 


Ree J FC : #00 |? HGS (2.2) 


H(x) is defined piecewise according to which of the random intervals 


(XX 47) contains x. Assume that the sample is extracted from an 


exponential distribution. For convenience, let the distribution's 
mean equal ane. Let u, v with u < v be the variables in the sample 


Space of (XX ligcinegcuntsdensity junetion 4s: 


ct) 


“uyr-T-ug-v(n-r) 


. i 
Fryart1Yev) = tecryt(neretye (1 & 


nrg Weceveceo ana | < ry < n=l. 


The value of the mean does not matter since the risk does not 
change with linear transformations of the basic random variable. 
(Personal communication from Professor R.R. Read) 





The end point densities can be derived as: 


fy _,(0.v) = fie ee ey < o (2a) 
i ny bu 3) = ne 4() - ae O<u<o (2.5) 
Define a = C 4] ~ Cos 
r-| 
hen: 
Een a x DS d 
ee 


Rewriting (2.2) 


n 2 
5 3 -X X-u =) 
(2778) 


The minimum risk coefficients, Cos can now be found using classical 


optimization techniques. Using the Lagrangain form, the problem 


becomes : 
Minimize: @= R - »( >> a. - 1) 
io ¢ 
n 
Sipwect to: =>, A. = | 
j-0 


where X is the Lagrange Multiplier. Thus, the approacn becomes: 


Set — = 0 for k = 0, J, ...,n and solve for C,. 
k 
oo. - = = ws = 2 kal Ses 


O<susx<v<o 

2 SS h 5 “el eae e “f (u,v)dudvdx 
r=k+1] Osusxsv<@ Pa aU. ah r rt 

1X 


= 0 (2.6) 





which after laborious but straightforward integration leads to: 


K+] 


C 2(n-k)(n-K41) - (2n-2k+1)(n-k+1) (n-k) In? | = Oye 





k 


(nk) (20-24 - 2(n-k)“(n-k+1) Ine | + x 





Cp (ners) (n=r) 10S 2 (n-rl) - a aftr 
Ey (n-r)| z ek) eee : bl a US (nek#1) 
(n-k) In oa (n-k) - 3 — > (n+1) (an) 





r=k+tl 
For k = n, (2.7) implies X = 0 (the indeterminate form -(n-k)1In(n-k) 
is zero by use of (2.4) and (2.5)). 
Using (2.5) in (2.6) for k = n provides: 


a ff %, ene U(q-e YO! qudx = 0 (2.8) 


O<usXxX<~ 


which imples: ae 


ns nt2 


To simplify notation, let: 











Fy (k) = (n-k+1)(n-k)In ae - Ape OG AG Ge ee 
-kt+2 
In =a] 
F,(k) = 2(n-k )(n-k+1) - (2n-2k+1) (n- k+1)(n- aie n- ad 








; n-k+ n-k+2 
F(k) = (n-k+1)(n-k) n= = - (n-k+2) (n-k+1) Ind 


Wek) = (n-k+2)(n-k+1)(n-k), n-kt2 _ (n-k+1)(n-k) _ Tela Grea) 


4(n+2) n-k 2(n+2) 
n-k+1 - : eel Gee) 
Imaek * (mk) = Qa nae 2 





Equation (2.7) may be written in matrix form: 


AC = H (2.9) 
where: 
F,(1) F,(2) F(3) F..(n-1) F,(n) 
F,(1) F)(2) F.(3) ; F.(n-1) F(n) 
0 F,(2) F, (3) F..(n-1) F(n) 
0 0 F,(3) F.(n-1) F.(n) 
0 Vs Follm=I)t ealin) 
Ae 
0 0 0 see F,(n-1) F,(n) 
0 0 0 ae eae) F(n) 
0 0 0 ‘ae F,(n-1) F(n) 
C, H(0) 
Cy H(1) 
C= and H = 
C4] H(n) 


Successive row subtraction with the last row and the last column 
discarded (since C, has already been determined) results in A 


becoming a tri-diagonal matrix of the form: 





G, (1) Go(1) 0 0 0 0 0 
G,(1) G, (2) G,(2) 0 0 0 0 
0 G,(2) G, (3) G,(3) 0 0 0 
0 0 G,(3) G, (4) 0 0 0 
0 0 0 Gp(4) 0 0 0 
a = 
0 0 0 0 G, (n-3) G,(n-3) 0 
0 0 0 0 G,(n-3) G,(n-2) Go(n-2) 
» 0 o 06 0 G,(n-2) G,(n-1) 
where: 
6. (k) = Fy(k) - F(k) 
G,(k) = F(k) ~ Fy (k) = Titer D, 
and 
H'(1) 
Ht = ae where H'(k) = H(k-1) - H(k). 
H'(n-1)| 
Equation (2.9) becomes: 
hunee aH (2.91) 


Equation (2.91) represents a 2nd order linear difference equation 
with variable coefficients. An explicit solution appears to be out 
of reach, forcing the use of numerical methods. 

This matrix equation can be solved by triangular decomposition for 
any given sample size. The algorithm for this deterministic 


solution and the associated Fortran subroutine, TRID, are described 


eZ 


a 





in Ref. 2. Coefficients for various sample sizes were computed using 


TRID and are shown with the corresponding Pyke coefficients in Table I. 
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(Pec neCuLAIION OF THE RISKS 


From equation (2.2), the risk function for the minimum risk 


coefficients, R(min), jis: 


2 
R(min) = y Sff f= « - C. - = e *F = 47 bu ,v )dudvdx 








Osusxsv<o 
Z 
_ x at -C — ian ae ahs ar (n- -~+2) ie r+1) (n- -y) 
= ay, r GIT 
n-r#2 2 _4,\ (n-rt1)(n-r) n-r+] 
hs n-r ah 20 (C., 1) n+] hr n-r 





1 2 (n-r+1)(n-r) | 2n-2r4) n-rt) 
ahr} r n+] | n-rt] 2(n-r)In n-r | 


(n+3)(nt2) (nt1) 


i (n-r+1)(n-r+2) | (3.1) 
fine risk function for the Pyke estimator, R(Pyke), is 


R( Pyke) 


n 
Z 
=x ij x-u ,_| ai 
x If 6 aT veu an | Ls Dees 





O<usx<V<x 
= 3 ja. 5 [stn s(n | (n-r+1)(3n-3r+1) 
r=0 | (n¥1)° (n#1)©(n42) 
(n-rt+])(n-rt2)  _ 2(n-r)(n-r#1)(2n-2r+1), =r] 
Neo nae te | (nt1)° n-r 
, {n- rine r+] )(n-r+2) ppncrte (3.2) 
2(n+1)¢ (n+2) ead 


le 





The risk 


R( SDF) 


function for the sample distribution function, R(SDF), 


: = r|¢ -X 
0 SL I or 4 e Fa pty (Uv) dudvdx 


r= 
OsusxsV<o 


(3S) 


oy} — 
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IV. RESULTS AND CONCLUSTONS 


Values of each type risk were computed for various sample sizes 
using (3.1), (3.2) and (3.3) and are shown in Table JI and Figure 1 
where it becomes apparent that the risk of the Pyke estimator is 
Significantly closer to the optimum than is the risk of the sample 
distribution function. All the risks converge to zero at the rate 
l/n. Thus, n times the risk converges to a constant (1/6). It is 
interesting to compare the risks as they approach this asympotote. 
This is done in Table III and Figure 2. 

These results coupled with those of Ref. 4 suggest that, given 
the criteria of minimizing expected squared error, the Pyke estimator 
Should be used in lieu of the sample distribution function, 
particularly if the underlying population is suspected to be either 


exponential or uniform. 
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VALUE OF RISK FUNCTIONS FOR SAMPLE SIZE N 
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N(MINIMUM RISK) 
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